Downdating the Latent Semantic Indexing Model for Conceptual Information Retrieval

نویسندگان

  • Dian I. Witter
  • Michael W. Berry
چکیده

Due to the growth of large data collections, information retrieval or database searching is of vital importance. Lexical matching techniques may retrieve irrelevant or inaccurate results because of synonyms and polysemous words, so effective concept-based techniques are needed. One such technique is latent semantic indexing (LSI) which uses a vector-space approach by identifying documents whose content is related to the user’s query in order of similarity. LSI uses the singular value decomposition (SVD) of term-by-document matrix to encode the terms and documents in a vector-space model. Existing methods for removing terms or documents from the term-document space are either time consuming or do not sufficiently change the term-document relationships. This paper presents a new method for downdating, downdating the reduced model (or DRM) method, and discusses its implementation into the LSI++ software environment. The DRM method can be used to assess the effect that a term or document has on the clustering of relevant information in a collection and for the incorporation of user feedback in the existing LSI model. Implementing the DRM method within LSI++ not only provides downdating functionality, but is less time consuming than recomputing the SVD when removing a term, document or both. The DRM method is a viable algorithm for dynamic information modeling and retrieval.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of a Vector Space Model, Latent Semantic Indexing and Formal Concept Analysis for Information Retrieval

Latent Semantic Indexing (LSI), a variant of classical Vector Space Model (VSM), is an Information Retrieval (IR) model that attempts to capture the latent semantic relationship between the data items. Mathematical lattices, under the framework of Formal Concept Analysis (FCA), represent conceptual hierarchies in data and retrieve the information. However, both LSI and FCA use the data represen...

متن کامل

Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach

Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an i...

متن کامل

Comparison of Information Retrieval Techniques: Latent Semantic Indexing and Concept Indexing

The task of information retrieval is to extract relevant documents for a certain query from the collection of documents. As large sets of documents are now increasingly common, there is a growing need for fast and efficient information retrieval algorithms. The algorithms we are dealing with are embedded in the vector space model. In this paper we compare two information retrieval techniques: l...

متن کامل

Probabilistic Latent Semantic Indexing Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{speci c synonymy as well as with polysemous words. In contrast ...

متن کامل

Latent Semantic Indexing for Patent Documents

Since the huge database of patent documents is continuously increasing, the issue of classifying, updating and retrieving patent documents turned into an acute necessity. Therefore, we investigate the efficiency of applying Latent Semantic Indexing, an automatic indexing method of information retrieval, to some classes of patent documents from the United States Patent Classification System. We ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Comput. J.

دوره 41  شماره 

صفحات  -

تاریخ انتشار 1998